Introduction

This report analyzes the City Lifestyle Segmentation dataset to: 1. Identify lifestyle-based clusters of world cities using PCA and clustering. 2. Compare developed vs. developing regions. 3. Explore the relationship between economic and environmental factors.

Load Data

data_path <- "../data/city_lifestyle.csv"
city <- readr::read_csv(data_path)
## Rows: 300 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): city_name, country
## dbl (8): population_density, avg_income, internet_penetration, avg_rent, air...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick summary
summary(city)
##   city_name           country          population_density   avg_income  
##  Length:300         Length:300         Min.   :  100      Min.   : 480  
##  Class :character   Class :character   1st Qu.: 1830      1st Qu.:1908  
##  Mode  :character   Mode  :character   Median : 3084      Median :2810  
##                                        Mean   : 3945      Mean   :2827  
##                                        3rd Qu.: 4824      3rd Qu.:3752  
##                                        Max.   :14427      Max.   :5720  
##  internet_penetration    avg_rent    air_quality_index public_transport_score
##  Min.   : 34.00       Min.   : 170   Min.   : 22.00    Min.   :15.00         
##  1st Qu.: 64.40       1st Qu.: 640   1st Qu.: 54.00    1st Qu.:46.08         
##  Median : 75.00       Median : 990   Median : 67.50    Median :54.70         
##  Mean   : 74.31       Mean   :1003   Mean   : 71.25    Mean   :55.72         
##  3rd Qu.: 87.22       3rd Qu.:1332   3rd Qu.: 86.00    3rd Qu.:64.20         
##  Max.   :100.00       Max.   :2430   Max.   :146.00    Max.   :95.00         
##  happiness_score green_space_ratio
##  Min.   :2.500   Min.   : 2.00    
##  1st Qu.:5.300   1st Qu.:28.23    
##  Median :6.900   Median :34.70    
##  Mean   :6.644   Mean   :33.99    
##  3rd Qu.:8.500   3rd Qu.:40.40    
##  Max.   :8.500   Max.   :58.00

Exploratory Data Analysis (EDA)

Data check

cat("Number of cities:", nrow(city), "\n")
## Number of cities: 300
cat("Number of features:", ncol(city), "\n")
## Number of features: 10
missing_summary <- colSums(is.na(city))
missing_summary[missing_summary > 0]
## named numeric(0)

Numeric variables distribution

city_num <- city |> dplyr::select(where(is.numeric))

city_num |>
  tidyr::pivot_longer(cols = everything(), names_to = "variable", values_to = "value") |>
  ggplot(aes(x = value)) +
  geom_histogram(bins = 20, fill = "#2C79B8", alpha = 0.7) +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric City Features",
       x = "Value",
       y = "Count")

Correlation matrix of city lifestyle features

corr_mat <- cor(city_num, use = "pairwise.complete.obs")

ggcorrplot(corr_mat,
           method = "square",
           type   = "lower",
           lab    = FALSE,
           outline.color = "white",
           show.legend   = TRUE) +
  labs(title = "Correlation Matrix of City Lifestyle Features")

Several meaningful relationships are observed:

  • avg_income and avg_rent show a strong positive correlation, indicating that wealthier cities tend to have higher living costs.
  • happiness_score is moderately positively correlated with both income and green_space_ratio, suggesting better wellbeing in greener and richer cities.
  • air_quality_index is negatively correlated with green space and income, implying that pollution is more severe in less affluent and less green cities.
  • public_transport_score shows a slight positive association with population density, indicating that highly populated cities invest more in transportation infrastructure.

These patterns highlight how economic development, environmental quality, and public services jointly shape the lifestyle structure of cities.

Population v.s. Happiness score

ggplot(city, aes(x = population_density, y = happiness_score)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  theme_minimal() +
  labs(title = "Happiness vs Population Density",
       x = "Population density",
       y = "Happiness score")
## `geom_smooth()` using formula = 'y ~ x'

Cities with higher population density tend to report lower happiness levels on average. - This suggests that overcrowding may reduce quality of life due to factors such as congestion, limited public space, noise, and increased stress. - However, the considerable vertical spread of points also indicates that happiness is influenced by additional factors beyond density, such as economic resources and environmental quality.

Air quality v.s. Income

ggplot(city,
       aes(x = avg_income,
           y = air_quality_index,
           size = green_space_ratio)) +
  geom_point(alpha = 0.7, color = "#1B7F79") +
  theme_minimal() +
  scale_size_continuous(name = "Green space ratio") +
  labs(title = "Air Quality vs Income (Bubble size = Green Space)",
       x = "Average income",
       y = "Air quality index")

A noticeable downward trend emerges: higher-income cities generally exhibit lower air quality index values, indicating better air quality (since lower index = less pollution).

In addition, larger bubbles are more frequently observed among cities with higher incomes, suggesting that wealthier cities also tend to feature more green infrastructure.

Combined, these results imply that economic resources enable improved environmental management, including investments in green spaces and pollution control.

Happiness across different contries

ggplot(city, aes(x = country, y = happiness_score, fill = country)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Happiness Score by Country",
       x = "Country",
       y = "Happiness score") +
  theme(legend.position = "none")

- Europe and North America show the highest median happiness scores, with relatively small variability, indicating consistently high quality of life in these developed regions. - Oceania also performs strongly, though sample size appears smaller. - Asia and South America have moderate happiness levels with wider spreads, suggesting uneven development and quality-of-life disparities within regions. - Africa exhibits the lowest median happiness score, reflecting challenges related to economic development and social wellbeing.

Clustering Analysis

PCA

Select numeric variables and standardize the data

city_num <- city |> dplyr::select(where(is.numeric))
city_scaled <- scale(city_num)

Run PCA

pca_res <- prcomp(city_scaled, scale. = FALSE)

pca_var <- pca_res$sdev^2
pca_var_ratio <- pca_var / sum(pca_var)

pca_var_ratio
## [1] 0.538058042 0.258331158 0.073279421 0.054527945 0.036307051 0.024752029
## [7] 0.008892435 0.005851919

The first principal component accounts for 53.8% of the variance, while the second accounts for an additional 25.8%. Together, PC1 and PC2 explain 79.6% of the total variance.

Scree Plot of PCA

plot(pca_var_ratio,
     xlab = "Principal Component",
     ylab = "Variance Explained",
     main = "Scree Plot of PCA",
     type = "b")

Thus, we retain the first 4 components for clustering and visualization to ensure an optimal balance between information preservation and dimensionality reduction.

PCA Projection (PC1 vs PC2)

pca_scores <- as.data.frame(pca_res$x[,1:2])
colnames(pca_scores) <- c("PC1", "PC2")

ggplot(pca_scores, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.7, color = "#006699") +
  theme_minimal() +
  labs(title = "Cities in PCA Space",
       x = "PC1", y = "PC2")

## K-Means Clustering ### Perform K-means Clustering on first 4 PCs

set.seed(42) 

pca_for_cluster <- as.data.frame(pca_res$x[,1:4]) # use the first four components
k3 <- kmeans(pca_for_cluster, centers = 3, nstart = 25)

city$cluster <- factor(k3$cluster)
pca_scores$cluster <- factor(k3$cluster)

table(city$cluster)
## 
##   1   2   3 
##  86 142  72

Cluster visualization in PCA Space (2D)

ggplot(pca_scores, aes(PC1, PC2, color = cluster)) +
  geom_point(alpha = 0.8, size = 3) +
  theme_minimal() +
  labs(title = "City Lifestyle Clusters in PCA Space",
       color = "Cluster")

### Cluster visualization in PCA Space (3D)

pca_3d <- as.data.frame(pca_res$x[,1:3])
colnames(pca_3d) <- c("PC1", "PC2", "PC3")
pca_3d$cluster <- city$cluster

plot_ly(pca_3d,
        x = ~PC1,
        y = ~PC2,
        z = ~PC3,
        color = ~cluster,
        colors = c("#1B7F79", "#D95F02", "#7570B3"),
        type = "scatter3d",
        mode = "markers",
        marker = list(size = 5, opacity = 0.85)) %>%
  layout(
    title = "3D PCA Clustering of City Lifestyles",
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    )
  )

Cluster profile table

cluster_profile <- city |>
  group_by(cluster) |>
  summarise(across(where(is.numeric),
                   list(mean = ~mean(.x, na.rm = TRUE),
                        sd = ~sd(.x, na.rm = TRUE)),
                   .names = "{.col}_{.fn}"))

cluster_profile

Cluster Interpretation:

Cluster 1 - Affordable but Lower-Quality

  • Moderate population density
  • Very low income
  • Poor internet access & public transportation
  • Lower happiness
  • Limited green space

Cluster 2 - High-Quality and Wealthy Cities

  • Lower population density (compared to Cluster 3)
  • Highest income and rent
  • Strong public transportation
  • Highest happiness
  • Good green space ratios

Cluster 3 - Urban-Pressure Cities

  • Extremely high population density
  • Medium income & internet penetration
  • High public transport scores
  • Lower air quality
  • Lower green space availability

Create a world map to visualize three clusters

world <- ne_countries(scale = "medium", returnclass = "sf")

world_cont <- world |>
  dplyr::group_by(continent) |>
  dplyr::summarise(geometry = sf::st_union(geometry))
region_cluster <- city |>
  dplyr::count(country, cluster) |>
  dplyr::group_by(country) |>
  dplyr::slice_max(n, n = 1, with_ties = FALSE)

region_cluster
map_data <- world_cont |>
  dplyr::left_join(region_cluster,
                   by = c("continent" = "country"))

ggplot(map_data) +
  geom_sf(aes(fill = cluster), color = "white") +
  scale_fill_brewer(palette = "Set2", na.value = "grey90") +
  theme_minimal() +
  labs(title = "Dominant City Lifestyle Cluster by World Region",
       fill  = "Cluster")